EXPLOITING UNTAPPED 
INFORMATION RESOURCES IN 
EARTH SCIENCE 


Rahul Ramachandran NASA/MSFC, Peter Fox RPI, Steve 
Kempler NASA/GSFC and Manil Maskey UAFI 


MSFC/UAH: Patrick Gatlin, Xiang Li, Amanda 
Weigel, JJ Miller, Kaylin Bugbee, Ajinkya Kulkarni 

GSFC: Chris Lynnes, Suhung Shen, Chung-Lin Shie, 
Maksym Petrenko 

RPI: Stefan Zednik, Anirudh Prabhu 




Outline 


1 . Project Overview 

2. Data Curation Service 

3. Rules Engine 

4. Image Retrieval Service 

5. Summary 



Part 1 : Project Overview 

• • • 
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Motivation 
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KNOWLEDGE 


• Data preparation steps are cumbersome and 
time consuming 

o Covers discovery, access and preprocessing 


• Limitations of current Data/Information Systems 
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o Boolean search on data based on instrument or 
geophysical or other keywords 

o Underlying assumption that users have sufficient 
knowledge of the domain vocabulary 

o Lack support for those unfamiliar with the domain 
vocabulary or the breadth of relevant data available 


Earth Science Metadata: 
Dark Resources 


• Dark resources - information resources that organizations 
collect, process, and store for regular business or operational 
activities but fail to utilize for other purposes 

o Challenge is to recognize, identify and effectively utilize these 
dark data stores 

• Metadata catalogs contain dark resources consisting of 
structured information, free form descriptions of data and 
browse images. 

o EOS Clearing House (ECHO) holds >6000data collections, 127 
million records for individual files and 67 million browse images. 


Premise: Metadata catalogs can be utilized beyond their original 
design intent to provide new data discovery and exploration 
pathways to support science and education communities. 


Goals 


• Design a Semantic Middleware 
Layer (SML) to exploit these 
metadata resources 

o provide novel data discovery and 
exploration capabilities that 
significantly reduce data 
preparation time. 

o utilize a varied set of semantic 
web, information retrieval and 
image mining technologies. 

• Design SML as a Service Oriented 
Architecture (SOA) to allow 
individual components to be 
used by existing systems 
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Science Use Cases 


• Dust storms, Volcanic Eruptions, Tropical Storms/ 
Hurricanes 

• Volcanic Eruptions: 

o Emit a variety of gases as well as volcanic ash, which are in 
turn affected by atmospheric conditions such as winds. 

o Role of Components 

• Image Retrieval Service is used to find volcanic ash 
events in browse imagery 

• Data Curation Service suggests the relevant datasets to 

support event analysis 

• Rules Engine invokes a Giovanni processing workflow to 

assemble and compare the wind, aerosol and S02 
data for the event 
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Find Events: Browse Images 




Band 1-4-3 (true color) Band 7-2-1 

Example: MODIS-Aqua 2008- 



Chaiten Volcano Eruption 
Eruption Time period: May 2 - Nov 2008 
Location: Andes region, Chile ( -42.832778, 
-72.645833) 





Suggest Relevant Data 

Total S0 2 mass: 

e.g. Chaiten is 10 (kt) =(kilotons ) , (lkt= 1000 metric tons) 
ftp://measures.qsfc.nasa.aov/data/s4pa/S02/MSVOI_SQ2l_4. 1 / 
MSVOLSQ2L4 v01-00-2014m1002.txt 


Daily S02: 

OMI/Aura Sulphur Dioxide ($02) Total Column Daily L2 Global 0.125 deg 

http://disc.sci.asfc.nasa.aov/datacollection/OMSQ2G V003.html 

Calibrated Radiances: 

MODIS/Aqua Calibrated Radiances 5-Min LI B Swath 1 km 

http://dx.d 0 i. 0 rg/l 0.5067/modis/mvd021 km.006 

Aerosol Optical Thickness: 

MODIS/Aqua Aerosol 5-Min L2 Swath 10km 
http://modis-atmos.asfc.nasa.aov/MOD04 L2 / 

SeaWiFS Deep Blue Aerosol Optical Depth and Angstrom Exponent Level 2 
Data 13.5km 

http://disc.asfc.nasa.aov/datacollection/SWDB L2 V004,shtml 



IR Brightness Temperature: 

NCEP/CPC 4-km Global (60 deg N - 60 deg $) Merged IR Brightness 
emperature Dataset 
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Generate Giovanni S02 Plots 

MODIS-Aqua 2008-05-03 18:45 UTC MODIS-Aqua 2008-05-05 18:30 UTC 


I2G.003 S02 Column 
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ftp://gdata2.sci.gsfc. nasa.gov/daac-bin/G3/gui. cgi?instance_id=omil2g 


Generate Giovanni Infrared Data Plot 


MODIS-Aqua 2008-05-03 18:45 UTC MODIS-Aqua 2008-05-05 18:30 UTC 



Global Merged IR (OOminl 8Z03MAY2008) 
Created by NASA Goddard GES DISC 





Global Merged IR (OOminl 7205MAY2008) 
Created by NASA Goddard GES DISC 
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Conceptual 

Model 

• Phenomena 

• Event type 

• Physical Feature 

• Manifestation / Driver of 
phenomena 

• Has space/time extent 

• Can precede or linger after 
what is generally thought of 
as the phenomena event 

• Observable Property 

• Characteristic/property of 
physical feature 

• Data Variable 

• Measurement/estimation of 
observable feature 



Phenomena 
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• Hurricane 

• Tropical Storm 

• Dust Storm 

• Volcanic 
Eruption 


• Ash Plume 

• Area of High Winds 

• Area of Elevated Surface 
Temperature 

• Area of High Particulate 
Emissions 


• Temperature 

• Radiance 

• Wind Speed 

• Rain Rate 


• MOD04_L2:Optical_Depth_Land_and_Ocea 
n_Mean 

• MOD02HKM:bands 1, 3, and 4 

• OMS02e:ColumnAmountS02 PBL 


Data 

Variable 









Part 2: Data Curation 

Algorithm for Phenomena 

• • • 

Initial Results 
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Objectives 

• Design a data curation (relevancy ranking) 
algorithm for a set of phenomena 

• Provide the data curation algorithm as a stand 
alone service 

• Envisioned Use: 

o Given a phenomenon type (Ex: Hurricane), DCS returns 
list of relevant data sets (variables) 

• <data of data sets> = DCS(Phenomenon Type) 

o For a specific phenomenon instance (event: Hurricane 
Katrina), these curated datasets can then be filtered 
based on space/time to get actual granules 

GHRO 
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Overview 



Relevancy 

Ranking 




Granule 


Data Variable 1 
Data Variable 2 
Data Variable 3 


Granule 
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Data Variable 1 
Data Variable 2 
Data Variable 3 



Granules 

Relevant data variables 
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Data Curation Algorithm 
Approaches 


• Text mining 

o Pros: Don’t need to explicitly 
define the phenomena 

o Cons: Dependent of the 
truth set; Catalog is 
dynamic and new data 
may never get classified 

• Ontology Based 

o Pros: Best precision and 
recall 

o Cons: Labor intensive to 
build an explicit model and 
map to instances 
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• Information Retrieval 

o Boolean (Faceted) Search 

• Pros: Simple to implement 

• Cons: Phenomena can be 
complex; User may not 
know all the right keywords 

o Relevancy Ranking Algorithm 

• Pros: List most relevant 
data first 

• Cons: Requires a custom 

algorithm 
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Data Search for Earth Science 
Phenomena 




V 
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All data sets useful x 
in studying 
“Hurricane” 
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By pass this step 
for the end user 


Best relevancy 
ranking 
algorithm? 


How to define a 
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How to 

automatically 
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Relevancy Ranking 

• Search: Curation problem 

• Data curation: Relevancy ranking service 
a set of Earth science phenomena 


Relevancy Ranking: 

Initial Exploratory Experiments 

• Approaches tested: 

o O-Rank (Top down approach) 
o Wikipedia Terms 
o Manual Terms Experiment 
o Latent Semantic Index - Dual Set terms 
o Metadata based Ranking 

• Key Takeaways: 

o Best results: three approaches where terms describing the 
phenomenon manually constructed after exploring 
metadata records 

o Both ontology and automated term construction (Wiki) 
approaches don’t map well to metadata terms/descriptions 
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Follow-on Experiments: 
Approach 





USER TASK 
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How to define a 
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How to 
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Expert select 
bag of words” to 
define a 
phenomena 


Control 
vocabulary 
(GCMD) is used 
for the “words” 


Best relevancy 
ranking 
algorithm? 


Use well known alg: 
Jaccard Coef, 
Cosine Similarity, 
Zone Ranking 
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Assumptions/Observations 


• Metadata quality 

o Richness 
o Vocabulary 
o Tags 

• Earth Science Phenomena can be defined 
using a bag of keywords 



Experiment Setup 


Datasets 


200 Randomly 
selected from ECHO 


I 


Binary Labeling: relevancy to phenomena 


(Label = Majority: 3 Scientists) 



Manually Selected set of 
GCMD Science Keywords 
relevant to phenomena 
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Precision 


Top 20 returns (Hurricane) 


Hurricane top 20 returns (Jac:30% Cos:30% Name:20% Des:20%) 
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Hurricane top 20 returns (Jac:17% Cos: 17% Name:33% Des:33%) 
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Next: Find relevant data fields 

• Dataset is relevant 

o now what? 

o how do I use the granules for the dataset? 

• Need actual data variable name 

o for example: Giovanni uses these fields for visualization 

• What we know 

o relevant science keywords (GCMD) - Experts 
o granule data fields and metadata - Auto extract* 

• How do we map? 

o manually? May work for few datasets only 
• Hundreds of data variables per granule 
o start with GCMD to CF Standard name 
o most don't follow CF Standard names 

GHRQ 
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Approach 

Dataset Granules 


Extract Science 
Keywords 


Extract Variables 
and Descriptions 


OPeNDAP, netCDF Libs, ... 


Text processing Text processing 


Remove special characters, Tokenize, 


Look up Table 


Acronym/Abbreviation expansion, CF 


Normalization 


Normalization 


Remove stopwords/Stem/Lemmatize 


Bag-of-words 


Bag-of-words 


Intersection 



NLP 

Learn Patterns 


Suggest Keywords 
Assess Metadata 
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Example: GLAS/ICESat L2 Global Thin Cloud/Aerosol 
Optical Depths Data (HDF5) V033 - Dataset Metadata 



IONTH 


EARTH DAT 


GLAS/ICESat L2 Global Thin Cloud/Aer 


©Temporal ▼ % Spatial ▼ 


TO Feedback 


Q Back to Granules 


GLAS/ICESat L2 Global Thin Cloud/Aerosol Optical Depths 
Data (HDF5) V033 version 33 

John.P.Dimarzio.1@nasa.gov 

ICESat Science Investigator-led Processing System (l-SIPS) 
757-864-1238 (phone) 

David.W.Hancock@nasa.gov 

NASA DAAC at the National Snow and Ice Data Center 
303-492-6199 (phone) 

303-492-2468 (fax) 
nsidc@nsidc.org 


Science Keywords: 

Earth Science Atmosphere Clouds 


Earth Science Atmosphere Aerosols 


GLAS/ICESat L2 Global Thin Cloud/Aerosol Optical Depths Data (HDF5) V033 
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Example: GLAS/ICESat L2 Global Th n Cloud/Aerosol 
Optical Depths Data (HDF5) V033 


KF HDFView 2.10.1 


File Window Tools Help 



CJ 



sd is 


Recent Files jrkDataVAII-ECHO-hurricane-datasets\hurricane_top_30_datagranules\extra_copy_GLAS_ICEsat_granule\GLAH11_633_2117_001_1275_0_01_0001.H5 


GHRJ 


GLAH1 1_633_21 17_001_1275_0_01_0001 H5 
ANCILLARY_DATA 
o- Cj BROWSE 
S Data_1HZ 
°“ Cj Angle 

© DS_Cloud_Layer_10 
[ © DS_UT CTi m e_1 
°- Flags 

Ok Geolocation 
9 <£3 Geophysical 

© r_Surface_pres 
© r_Surface_relh 
© r_Surface_temp 
© r_Surface_wdir 
© r_Surface_wind 
© r_cld1_grd_det 
o- OD1064CloudLayers 
9 OD532CloudLayer 
I- © i_dd1_qf 
© i_cld1_uf 
© r_MRg_cldbot_pres 
© r_MRg_cldbot_relh 
© r_MRg_cldbot_temp 
© r_MRg_cldtop_pres 
© r_MRg_cldtop_relh 
© r_MRg_cldtop_temp 
© r_cld1_bot 
© r_cld1_msf 
© r_cld1_od 
© r_cld1_top 
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Sample file: GLAH1 1 .033/2006.1 0.25/ 
GLAH11 633 2117 001 1275 0 01 0001. H5 


Data Variables 




Example: GLASICESat L2 Global Thin Cloud Aerosol Optical 
Depths Data (HDF5) V033 

Science keyword to variable mapping 

• r_Surface_relh | Surface Relative Humidity 

o No match 

• r_Surface_temp | Surface Temperature 

o No match 

• r_Surface_wind | Surface Wind Speed 

o No match 

r_cldl_od | Cloud Optical Depth at 532 nm 

o Score=3 keyword: ATMOSPHERE->CLOUDS->CLOUD OPTICAL DEPTH/THICKNESS 
o Score=2 keyword: ATMOSPHERE->AEROSOLS->AEROSOL OPTICAL DEPTH/THICKNESS 


Variable to keyword mapping 

• ATMOSPHERE->CLOUDS->CLOUD OPTICAL DEPTH/THICKNESS 
o Score=3 name: r_cld_ir_OD | Cloud Optical Depth at 1064 nm 
o score=3 name:i_cldl_qf | Cloud optical depth flag for 532 nm 
o Score=3 name:i_cldl_uf | Cloud optical depth flag for 532 nm 
o Score=3 name:r_cldl_od | Cloud Optical Depth at 532 nm 
o more with low scores 
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This approach can be used to assess metadata quality and 
also suggest keyword annotation!! 


Part 3: Rules Engine 


What Settings should I use to visualize 
this event? 



Sock line: NLDAS_FORA025_M .002 Anomaly of precipitation monthly total [kg/m A 2] 



Jan1979 


Orxraled 2013-03-12 1700:43 GMT 0 NASA GEB DISC 


I f Dataset 

\/ Dat ui V Visualizati 

Variable on Type? 



Goal: Automate data preprocessing and exploratory analysis and visualization tasks 


Images from : http://globe-views.com/dcim/dreams/volcano/volcano-03.jpg , http://grecaira.users37.interdns.co.uk/essay/images/confused.png , http://disc.sci.gsfc.nosa.gov/datareleases/images/ 
nldas_monthly_climatology_figure_9.gif 


Strategy 

• Service to generate and rank candidate workflow 
configurations 

• Use rules to make assertions about compatibility based on 
multiple factors 

o does this data variable make sense for this feature? 

o does this visualization type make sense for this feature? 

o does the temporal / spatial resolution of this dataset make sense 
for this feature? 

• Each compatibility assertion type is assigned weights. 

o ex: Strong = 5, Some = 3, Slight = 1 , Indifferent = 0, Negative = -1 . 

• Based on the aggregated compatibility assertions, we 
calculate the score for each visualization candidate. 


Phenomena Feature Characteristic 
Mappings 


Phenomena 

East- 

West 

Movem 

ent 

North- 

South 

Movement 

Temporal 

Evolution 

Spatial 
Extent of 
Event 

Year-to- 

Year 

Variability 

May 

Impact 

Seasonal 

Variation 

Variation 

with 

Atmospher 
ic Height 

Global 

Phenomen 

a 

Detection 
of Events 

Volcano - 
Ash Plume 

Indiffere 

nt 

Indifferent 

Strong 

Slight 

Strong 

Strong 

Strong 

Strong 

Strong 

Flood 

Some 

Some 

Strong 

Some 

Some 

Strong 

Some 

Slight 

Some 

Dust Storm 

Strong 

Strong 

Strong 

Strong 

Indifferent 

Indifferent 

Strong 

Indifferent 

Some 



Service to Characteristic Mappings 
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Compute Compatibility 



Phenomena: 
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Images from :http://i.dailymail.co.uk/i/pix/201 0/05/1 8/article-l 279221 -09 A0CCC4000005DC-385_634x433.jpg, http://disc.sci.gsfc. nasa.gov/datareleases/images/nldas_monthly_dimatology_figure_9.gif, http:// 
www.clipartbest.com/cliparts/biy/bAX/biybAXGiL.png 




Next Steps 

• Generate rules for compatibility assertions 
based on 

odata variables 
o temporal / spatial resolution 
o dataset processing 

• Explore additional strategies for making 
compatibility assertions 


Part 4: Image Retrieval 

• • • 

Initial Results 


Image Retrieval 

• Goal: given an image of Earth science 
phenomenon retrieve similar images 


• Challenge: “semantic gap” 

o low-level image pixels and high-level 
semantic concepts perceived by human 
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Image retrieval approaches 

• Tradition approaches 

o Image features: Color, Texture, Edge histogram... 
o “Shallow” architecture 
o User defines the feature 
o Preliminary experiments 

• State of the Art approach 

o Generic 

o No need for domain expert 
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Deep Learning 

• Mimics the human brain that is organized in a deep 
architecture 

• Processes information through multiple stages of 
transformation and representation 

• Learns complex functions that directly map pixels to 
the output, without relying on human-crafted 
features 
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Source: Google Research, CVPR 2014 


Experiment Setup 

• NASA rapid response MODIS imageries 

• 600 imageries 

• 3 phenomena - Hurricane, Dust, Smoke/ 
Haze 

• Train half images with Convolutional Neural 
Network 

• Test 
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Sample rapid response images 



Hurricane Smoke Dust 
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4 layers 

- Used number of filters in each layer = 100, 200, 400, 800 

- Convolved and Pooled on every layer 

- Overall accuracy ~ = as that of 5 layers (slightly better 

than 6 layers) 


Error Matrix 


4 layers, learning rate = 0.003 


True\Pred 

Others 

dust 

Haze/smoke 

Hurricane 

Others 

173 

20 

76 

7 

dust 

29 

128 

23 

0 

Haze/Smoke 

79 

3 

207 

7 

Hurricane 

30 
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163 


Accuracy Numbers 


Producers Accuracy 


• Other: 

173/31 1 

= 55.6% 

• Dust: 

128/153 

= 83.7% 

• Smoke: 

207/309 

= 67% 

• Hurricane: 

163/177 

= 92.1% 

Users Accuracy 

• Other: 

1 73/276 

= 62.7% 

• Dust: 

1 28/ 1 80 

= 71.1% 

• Smoke: 

207/296 

= 69.9% 

• Hurricane: 

163/198 

= 82.3% 


Overall accuracy - 70.6% 


Summary 

• Build three specific semantic middleware core 
components 

o Image retrieval service - uses browse imagery to enable 
discovery of possible new case studies and also presents 
exploratory analytics. 

o Data curation service - uses metadata and textual 
descriptions to find relevant data sets and granules needed 
to support the analysis of a phenomena or a topic. 

o Semantic rules engine - automates data preprocessing 
and exploratory analysis and visualization tasks. 

Explore pathways to infuse these components into 
existing NASA information and data system 
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